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Abstract 

Object detection systems based on the deep convolu¬ 
tional neural network (CNN) have recently made ground¬ 
breaking advances on several object detection benchmarks. 
While the features learned by these high-capacity neural 
networks are discriminative for categorization, inaccurate 
localization is still a major source of error for detection. 
Building upon high-capacity CNN architectures, we ad¬ 
dress the localization problem by 1) using a search algo¬ 
rithm based on Bayesian optimization that sequentially pro¬ 
poses candidate regions for an object bounding box, and 
2) training the CNN with a structured loss that explicitly 
penalizes the localization inaccuracy. In experiments, we 
demonstrate that each of the proposed methods improves 
the detection performance over the baseline method on 
PASCAL VOC 2007 and 2012 datasets. Furthermore, two 
methods are complementary and significantly outperform 
the previous state-of-the-art when combined. 

1. Introduction 

Object detection is one of the long-standing and impor¬ 
tant problems in computer vision. Motivated by the recent 
success of deep learning [27, 21, 3, 6, 28, 4, 36] on vi¬ 
sual object recognition tasks [26, 41, 49, 42, 45], signifi¬ 
cant improvements have been made in the object detection 
problem [44, 11, 20]. Most notably, Girshick et al. [18] 
proposed the “regions with convolutional neural network” 
(R-CNN) framework for object detection and demonstrated 
state-of-the-art performance on standard detection bench¬ 
marks (e.g., PASCAL VOC [12, 13], ILSVRC [35]) with a 
large margin over the previous arts, which are mostly based 
on deformable part model (DPM) [15]. 

There are two major keys to the success of the R-CNN. 
First, features matter [18]. In the R-CNN, the low-level 
image features (e.g., HOG [8]) are replaced with the CNN 
features, which are arguably more discriminative represen¬ 
tations. One drawback of CNN features, however, is that 
they are expensive to compute. The R-CNN overcomes this 
issue by proposing a few hundreds or thousands candidate 


bounding boxes via the selective search algorithm [47] to 
effectively reduce the computational cost required to evalu¬ 
ate the detection scores at all regions of an image. 

Despite the success of R-CNN, it has been pointed out 
through an error analysis [22] that inaccurate localization 
causes the most egregious errors in the R-CNN frame¬ 
work [18]. For example, if there is no bounding box in the 
close proximity of ground truth among those proposed by 
selective search, no matter what we have for the features 
or classifiers, there is no way to detect the correct bound¬ 
ing box of the object. Indeed, there are many applications 
that require accurate localization of an object bounding box, 
such as detecting moving objects (e.g., car, pedestrian, bi¬ 
cycles) for autonomous driving [17], detecting objects for 
robotic grasping or manipulation in robotic surgery or man¬ 
ufacturing [29], and many others. 

In this work, we address the localization difficulty of the 
R-CNN detection framework with two ideas. First, we de¬ 
velop a fine-grained search algorithm to expand an initial 
set of bounding boxes by proposing new bounding boxes 
with scores that are likely to be higher than the initial ones. 
By doing so, even if the initial region proposals were poor, 
the algorithm can find a region that is getting closer to 
the ground truth after a few iterations. We build our al¬ 
gorithm in the Bayesian optimization framework [32, 43], 
where evaluation of the complex detection function is re¬ 
placed with queries from a probabilistic distribution of the 
function values defined with a computationally efficient sur¬ 
rogate model. Second, we train a CNN classifier with a 
structured SVM objective that aims at classification and lo¬ 
calization simultaneously. We define the structured SVM 
objective function with a hinge loss that balances between 
classification (i.e., determines whether an object exists) and 
localization (i.e., determines how much it overlaps with the 
ground truth) to be used as the last layer of the CNN. 

In experiments, we evaluated our methods on PASCAL 
VOC 2007 and 2012 detection tasks and compared to other 
competing methods. We demonstrated significantly im¬ 
proved performance over the state-of-the-art at different lev- 
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els of intersection over union (loU) criteria. In particular, 
our proposed method outperforms the previous arts with a 
large margin at higher loU criteria (e.g., loU = 0.7), which 
highlights the good localization ability of our method. 

Overall, the contributions of this paper are as follows: 

1) we develop a Bayesian optimization framework that can 
find more accurate object bounding boxes without signif¬ 
icantly increasing the number of bounding box proposals, 

2) we develop a structured SVM framework to train a CNN 
classifier for accurate localization, 3) the aforementioned 
methods are complementary and can be easily adopted to 
various CNN models, and finally, 4) we demonstrate sig¬ 
nificant improvement in detection performance over the R- 
CNN on both PASCAL VOC 2007 and 2012 benchmarks. 

2. Related work 

The DPM [15] and its variants [33, 16] have been the 
dominating methods for object detection tasks for years. 
These methods use image descriptors such as HOG [8], 
SIFT [31], and LBP [1] as features and densely sweep 
through the entire image to find a maximum response re¬ 
gion. With the notable success of CNN on large scale ob¬ 
ject recognition [26], several detection methods based on 
CNNs have been proposed [41, 40, 44, 11, 18]. Following 
the traditional sliding window method for region proposal, 
Sermanet et al. [41] proposed to search exhaustively over an 
entire image using CNNs, but made it efficient by conduct¬ 
ing a convolution on the entire image at once at multiple 
scales. Apart from the sliding window method, Szegedy 
et al. [44] used CNNs to regress the bounding boxes of ob¬ 
jects in the image and used another CNN classifier to ver¬ 
ify whether the predicted boxes contain objects. Girshick 
et al. [18] proposed the R-CNN following the “recognition 
using regions” paradigm [19], which also inspired several 
previous state-of-the-art methods [47, 48]. In this frame¬ 
work, a few hundreds or thousands of regions are proposed 
for an image via the selective search algorithm [47] and the 
CNN is finetuned with these region proposals. Our method 
is built upon the R-CNN framework using the CNN pro¬ 
posed in [42], but with 1) a novel method to propose extra 
bounding boxes in the case of poor localization, and 2) a 
classifier with improved localization sensitivity. 

The structured SVM objective function in our work is 
inspired by Blaschko and Lampert [5], where they trained 
a kernelized structured SVM on low-level visual features 
(i.e., HoG [8]) to predict the object location. Schulz and 
Behnke [39] integrated a structured objective with the deep 
neural network for object detection, but they adopted the 
branch-and-bound strategy for training as in [5]. In our 
work, we formulate the linear structured objective upon 
high-level features learned by deep CNN architectures, 
but our negative mining step is very efficient thanks to 
the region-based detection framework. We also present a 


gradient-based optimization method for training our archi¬ 
tecture. 

There have been several other related work for accu¬ 
rate object localization. Fidler et al. [16] incorporated the 
geometric consistency of bounding boxes with bottom-up 
segmentation as auxiliary features into the DPM. Dai and 
Hoiem [7] used the structured SVM with color and edge 
features to refine the bounding box coordinates in DPM 
framework. Schulter et al. [38] used the height prior of an 
object. These auxiliary features to aid object localization 
can be injected into our framework without modifications. 

Localization refinement can be also taken as a CNN re¬ 
gression problem. Girshick et al. [18] extracted the mid¬ 
dle layer features and linearly regressed the initially pro¬ 
posed regions to better locations. Sermanet et al. [41] re¬ 
fined bounding boxes from a grid layout to fiexible locations 
and sizes using the higher layers of the deep CNN architec¬ 
ture. Erhan et al. [ 1] jointly conducted classification and 
regression in a single architecture. Our method is different 
in that 1) it uses the information from multiple existing re¬ 
gions instead of a single bounding box for predicting a new 
candidate region, and 2) it focuses only on maximizing the 
localization ability of the CNN classifier instead of doing 
any regression from one bounding box to another. 

3. Fine-grained search for bounding box via 
Bayesian optimization 

Let f{x,y) denote a detection score of an image x at the 
region with the box coordinates y = (i^i, 1 ^ 2 , ^ 2 ) ^ 3^- 

The object detection problem deals with finding the local 
maximum of /(x, y) with respect to y of an unseen image 
x} As it requires an evaluation of the score function at 
many possible regions, it is crucial to have an efficient algo¬ 
rithm to search for the candidate bounding boxes. 

A sliding window method has been used as a domi¬ 
nant search algorithm [8, 15], which exhaustively searches 
over an entire image with fixed-sized windows at differ¬ 
ent scales to find a bounding box with a maximum score. 
However, evaluating the score function at all regions de¬ 
termined by the sliding window approach is prohibitively 
expensive when the CNN features are used as the im¬ 
age region descriptor. The problem becomes more severe 
when fiexible aspect ratios are needed for handling object 
shape variations. Alternatively, the “recognition using re¬ 
gions” [19, 18] method has been proposed, which requires 
to evaluate significantly fewer number of regions (e.g., few 
hundreds or thousands) with different scales and aspect 
ratios, and it can use the state-of-the-art image features 
with high computational complexity, such as the CNN fea¬ 
tures [10]. One potential issue of object detection pipelines 
based on region proposal is that the correct detection will 

^When multiple (including zero) objects exist, it involves finding the 
local maxima that exceed a certain threshold. 
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Figure 1: Pipeline of our method. 1) Initial bounding boxes are given by methods such as the selective search [47] and 
their detection scores are obtained from the CNN-based classifier trained with structured SVM objective. 2) The box(es) 
with optimal score(s) in the local regions are found by greedy NMS [18] (shown as “local optimum” boxes), and Bayesian 
optimization takes the neighborhood of each local optimum to propose a new box with a high chance of getting a better 
detection score. 3) We evaluate the detection score of the new box, take it as an observation for the next iteration of the 
Bayesian optimization until convergence. Note that the local optimum and the search region may change in each iteration. 4) 
All the bounding boxes are fed into the standard post-processing stage (e.g., thresholding and NMS, etc.). 


not happen when there is no region proposed in the prox¬ 
imity of the ground truth bounding box.^ To resolve this 
issue, one can propose more bounding boxes to cover the 
entire image more densely, but this would significantly in¬ 
crease the computational cost. In this section, we develop 
a fine-grained search (FGS) algorithm based on Bayesian 
optimization that sequentially proposes a new bounding 
box with a higher expected detection score than previously 
proposed bounding boxes without significantly increasing 
the number of region proposals. We first present the gen¬ 
eral Bayesian optimization framework (Section 3.1) and de¬ 
scribe the FGS algorithm using Gaussian process as the 
prior for the score function (Section 3.2). We then present 
the local FGS algorithm that searches over multiple local 
regions instead of a single global region (Section 3.3), and 
discuss the hyperparameter learning of our FGS algorithm 
(Section 3.4). 

3.1. General Bayesian optimization framework 

Let {^ 1 , • • • ,^Ar} be the set of solutions (e.g., bound¬ 
ing boxes). In the Bayesian optimization framework, / = 
/(x, y) is assumed to be drawn from a probabilistic model: 


when / is computationally expensive. When a(^ 7 v+i |7^Ar) 
is much less expensive than / to evaluate, and the com¬ 
putation for arg a{yN+i \^n) requires only a few 

function evaluations, we can efficiently find a solution that 
is getting closer to the ground truth. 

3.2. Efficient region proposal via GP regression 

A Gaussian process (GP) defines a prior distribution 
p{f) over the function f : y ^ R. Due to this prop¬ 
erty, a distribution over / is fully specified by a mean func¬ 
tion m : y -f R and a positive definite covariance ker- 
nel k : y X y ^ R, i.e., / ^ QV{m{-),k Specif¬ 
ically, for a finite set G y, the random vec¬ 

tor [fj]i<j<N follows a multivariate Gaussian distribution 
M {[m{yj)\i<j<N, A random Gaus¬ 

sian noise with precision /3 is usually added to each fj in¬ 
dependently in practice. Here, we used the constant mean 
function m{y) = mo and the squared exponential covari¬ 
ance kernel with automatic relevance determination (SEard) 
as follows: ksEmi{yi,yj] z) = 


P(/|2^Jv) cxp(r>jv|/)p(/), (1) 

where Vn = {(%,/i)}^i and fj = f{x,yj). Here, 
the goal is to find a new solution yN+i that maximizes 
the chance of improving the detection score /at+i, where 
the chance is often defined as an acquisition function 
a(^Ar+i|Div). Then, the algorithm proceeds by recursively 
sampling a new solution yN+t from ^nd update 

the set pAT+t = DAr+(t-i) U {^Ar+t} to draw a new sample 
solution ^ 7 v+(t+i) with an updated observation. 

Bayesian optimization is efficient in terms of the num¬ 
ber of function evaluation [25], and is particularly effective 


where A is a 4 x 4 diagonal matrix whose diagonal entries 
are A?, i = 1, • • • ,4. These form a 7-dimensional GP hy¬ 
perparameter 0 = (/3, mo, 7 ^, A^, A 2 , A 3 , A 4 ) to be learned 
from the training data, : A’ ^ transforms the bound¬ 
ing box coordinates y into a new form: 


^z{y) = 


exp( 2 ;) ’ exp(z) 


log w ; log h 


( 2 ) 


where u = ^nd v = denote the center coor¬ 

dinates, w = U2 — ui denotes the width, and h = V2 — vi 
denotes the height of a bounding box. We introduce a latent 
variable 2 ; to make the covariance kernel scale-invariant.^ 


^We refer to selective search as a representative method for region pro¬ 
posal. 


^If the image and the bounding boxes yi,yj are scaled down by a cer¬ 
tain factor, we can keep ksEardiyi invariant by properly setting 2;. 
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We determine 2: in a data-driven manner by maximizing the 
marginal likelihood of Pat, or 

5 = argmaxp({/j}jti|{2/j}^i;6»). (3) 

Z 

The GP regression (GPR) problem tries to find a new ar¬ 
gument ^AT+i given N observations that maximizes the 
value of acquisition function, which, in our case, is defined 
with the expected improvement (El) as: \^n) = 

poo 

I if - In) ■pif\yN+i,T^N;0)df (4) 

^ In 

where /at = maxi<j<Ar/j. The posterior of 
/at+i given (^Ar+i,^Ar) follows Gaussian distribution: 

p{fN+i\yN+i,'DN;0) = 

•^(/iv+i; M(2/JV+i|2^jv),o-^(2/iv+i|^?iv)), (5) 

with the following mean function and covariance kernels: 

n{yN+i\VN) = Too + k^+iK^^ ([/j - , 

o'^iVN+il'I^N) = fcjv+1 — k^+iK^^kjv+1, 
kN+i = /?“^ + k{yN+i,yN+i), 
kiv+i = [k{yN+i,yj)]^^j^j^, 

Kiv = [k{yi, 2 /j)] • 

We refer [34] for detailed derivation. By plugging (5) in (4), 

CiEl{yN-\-l\'^N) = cr (^at+iI^at) X 

(7(2/jv+i)F(7(yiv+i)) + A/'(7(2/Ar+i); 0,1)) (6) 

where jiyN+i) = is the cumulative 

distribution function of standard normal distribution A/’('). 

3.3. Local fine-grained search 

In this section, we extend the GPR-based algorithm for 
global maximum search to local fine-grained search (FGS). 

The local FGS steps are described in Figure 1 . We per¬ 
form the FGS by pruning out easy negatives with low clas¬ 
sification scores from the set of regions proposed by the 
selective search algorithm and sorting out a few bounding 
boxes with the maximum scores in local regions. Then, for 
each local optimum ^best boxes in Figure 1), we propose 

a new candidate bounding box (green boxes in Figure 1). 
Specifically, we initialize a set of local observations Piocai 
for ^best from the set given by the selective search algorithm, 
whose localness is measured by an loU between ^best and re¬ 
gion proposals (yellow boxes^ in Figure 1). Aocai is used to 
fit a GP model, and the procedure is iterated for each local 

^In practice, the local search region associated with Aocai is not a rect¬ 
angular region around local optimum since we use loU to determine it. 


Algorithm 1 Local fine-grained search (FGS) 

Require: Image x, classifier / cnn , a set of structured labels and 
classification scores Vn — {{Vj^ fj)f=i}^ GP hyperparam¬ 
eter 0, maximum number of GP iterations tmax, a threshold 
/prune to prunc out the bounding boxes, different levels of loU 
Pr, r = 1,..., i? determining the size of local regions. 

Ensure: A set of structured labels and classification scores V. 

T) -E- IDn 

2: for t — 1, • •' 5 tmax do 

31 T^proposal — -0 

4: T^prune = {(t/j f) ^ ^ f ^ /prune} 

5: T^NMS — N]V[S(X^prune) 

6: for each (t/best, /best) G Pnms do 

7: for r = 1, • • • , it do 

T^local — {(y?/) ^ • IoU(p, t/best) ^ Pr} 

9: z = arg max^ p(T>iocai; 0) (Equation (3)) 

10: y = arg max^ asi{y{Viocai; 0, z) (Equation (6)) 

11: f = fcNN{x,y) 

12: T^proposal i^pj-oposal U }(p, /)} 

13: end for 

14: end for 

15: T> ^ T> U i^pj-oposai 

16: end for 


optimum at different levels of loU until there is no more ac¬ 
ceptable proposal. We provide a pseudocode of local FGS 
in Algorithm I, where the parameters are set as: t^ax = 8, 
(p,)?=3 = (0.3,0.5,0.7). 

In addition to the capability of handling multiple objects 
in a single image, better computational efficiency is another 
factor making local FGS preferable to global search. As 
a kernel method, the computational complexity of GPR in¬ 
creases cubically to the number of observations. By restrict¬ 
ing the observation set to the nearby region of a local opti¬ 
mum, the GP fitting and proposal process can be performed 
efficiently. In practice, FGS introduces only < 20% compu¬ 
tational overhead compared to the original R-CNN. Please 
see the appendices, which are also available in our techni¬ 
cal report [50], for more details on its practical efficiency 
(Appendix A4). 

3.4. Learning GP hyperparameter 

As we locally perform the FGS, the GP hyperparameter 
0 also needs to be trained with observations in the vicinity 
of ground truth objects. To this end, for an annotated object 
in the training set, we form a set of observations with the 
structured labels and corresponding classification scores of 
the bounding boxes that are close to the ground truth bound¬ 
ing box. Such an observation set is composed of the bound¬ 
ing boxes (given by selective search and random selection) 
whose loU with the ground truth exceed a certain threshold. 
Finally, we fit a GP model by maximizing the joint likeli- 
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where 


hood of such observations: 

§ = argmax V logp({(y,/) :yeyi,f = f{xi,y)};9), 

^ itr 

pos 

where /pos is the index set for positive training samples (i.e., 
with ground truth object annotations), and ^^^is a ground 
truth annotation of an image Xi.^ We set X = {y = 
ViOiy e yiOiy e yi : lo\J{y,yi) > p}}, where yi con¬ 
sists of the bounding boxes given by selective search on Xi, 
y^i is a random subset of y, and p is the overlap threshold. 
The optimal solution 0 can be obtained via L-BFGS. Our 
implementation relies on the GPML toolbox [34] . 

4. Learning R-CNN with structured loss 

This section describes a training algorithm of R-CNN for 
object detection using structured loss. We first revisit the 
object detection framework with structured output regres¬ 
sion introduced by Blaschko and Lampert [5] in Section 4.1, 
and extend it to R-CNN pipeline that allows training the net¬ 
work with structured hinge loss in Section 4.2. 

4.1. Structured output regression for detection 

Let {xi, X 2 ,..., xm} be the set of training images and 
{^ 1 , ^ 2 , • • • 5 Vm} be the set of corresponding structured la¬ 
bels. The structured label y ^ y composed of 5 ele¬ 
ments (/, 1 ^ 2 ,'^ 2 ); when I = 1 , and ( 1 ^ 2 ,'^ 2 ) 

denote the top-left and bottom-right coordinates of the ob¬ 
ject, respectively, and when I = — 1 , it implies that there 
is no object in Xi, and there is no meaning on coordinate 
elements (r^i, ^ 1 , r^ 2 , ^ 2 )- Note that the definition of y^ is 
extended from Section 3 to indicate the presence of an ob¬ 
ject (/) as well as its location (r^i, t’l, 1 ^ 2 ,'^ 2 ) when exists. 
When there are multiple objects in an image, we crop an 
image into multiple positive (/ = 1 ) images, each of which 
contains a single object, and a negative image (/ = —!) that 
doesn’t contain any object.^ Let y) represent the fea¬ 
ture extracted from an image x for a label y with / = 1 . 
In our case, y) denotes the top-layer representations 
of the CNN (excluding the classification layer) at location 
specified by y,^ which are fed into the classification layer. 
The detection problem is to find a structured label y ^ y 
that has the highest score: 

g{x] w) = arg max /(x, y; w) (7) 

yey 


^We assumed one object per image. See Section 4.2 for handling mul¬ 
tiple objects in training. 

^We also perform the same procedure for images with a single object 
during the training. 

^Following [18], we crop and warp the image patch of x at location 
given by y to a fixed size (e.g., 224x224) to compute the CNN features. 


f{x,y;w) = w'^4){x,y), ( 8 ) 

feri = (9) 

Note that (9) includes a trick for setting the detection thresh¬ 
old to 0. The model parameter w is trained to minimize the 
structured loss A(', •) between the predicted label g{xi;w) 
and the ground-truth label yi : 

M 

w = arg min '^A{g{xi-,w),yi) ( 10 ) 

W . 

1=1 

For the detection problem, the structured loss A{y,yi) is 
defined in terms of intersection over union (loU) of two 
bounding boxes defined by y and yi as follows: 

{\-\o\]{y,yi) ,iil = k = y 

^{y,yi)=lo = = ( 11 ) 

[i 

where lo\J{y,yi) = In general, the optimiza- 

tion problem ( 10 ) is difficult to solve due to the complicated 
form of structured loss. Instead, we formulate the surrogate 
objective in structured SVM framework [46] as follows: 

1 C ^ 

min + Vr , subject to ( 12 ) 

i=l 

nP‘'^{xi,yi) > nX'^{xi,y) + A{y,yi) - ^i,Vy,'ii (13) 
6 > 0, Vi (14) 

Using (9) and (11), the constraint (13) is written as follows: 

yi) > w'''(j){xi, y) + yi) - ^i, (15) 

Vy G y'=i),Vi e/pos, 

w'^(f>{xi, yi)>l- ii, Vi G /pos, (16) 

w'^<t>{xuy) < - 1 + Vy G y'='\Vi G /„eg, (17) 

where = 1 - IoU(y, yi), /pos and /neg denote the 

set of indices for positive and negative training examples, 
respectively, and = {y e y : I = 1}. 

4.2. Gradient-based learning of R-CNN with struc¬ 
tured SVM objective 

To learn the R-CNN detector with structured loss, we 
propose to make several modifications to the original struc¬ 
tured SVM formulation. First, we restrict the output space 
X C iy of ith example to regions proposed via selective 
search. This results in a change in notation for every y 
in (15) and (17) of ith example to y^. Second, the con¬ 
straints (15, 16, 17) should be transformed into hinge loss to 
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backpropagate the gradient to lower layers of CNN. Specifi¬ 
cally, the objective function ( 12 ) is reformulated as follows: 

imn + X] Vs,i + C'2 /ineg.i) (18) 

i G -^pos i G -^neg 

where h^os,i = hpos,i{w), Ktg,i = h^eg,i{w) are given as: 
hpos,iiw) = maxjo, 1 - (19) 

max {w'^ {4>{xi,y) - 4>{xi,yi)) + A^°^{y,yi)) \ 


(ineg,i(w') = max-^ 0 , max {1 + w~'^(l>{xi,y))\ ( 20 ) 

Note that we use different C values for positive and negative 
examples. In experiments, Ci = 2 and C 2 = 1. 

Structured SVM objective may cause a slow conver¬ 
gence in parameter estimation since it utilizes at most one 
instance y among a large number of instances in the (re¬ 
stricted) output space 3 ^^, whose size varies from few hun¬ 
dreds to thousands. To overcome this issue, we alternately 
perform a gradient-based parameter estimation and hard 
negative data mining that effectively adapts the number of 
training examples to be evaluated for updating the parame¬ 
ters (Appendix A2). 

For model parameter estimation, we use L-BFGS to first 
learn parameters of the classification layer only. We found 
that this already resulted in a good detection performance. 
Then, we optionally use stochastic gradient descent to fine- 
tune the whole CNN classifiers (Appendix Al). 

5. Experimental results 

We applied our proposed methods to standard visual 
object detection tasks on PASCAL VOC 2007 [12] and 
2012 [14]. In all experiments, we consider R-CNNs [18] as 
baseline models. Following [18], we used the CNN models 
pretrained on ImageNet database [9] with 1,000 object cat¬ 
egories [26, 42], and finetuned the whole network using the 
target database by replacing the existing softmax classifica¬ 
tion layer to a new one with a different number of classes 
(e.g., 20 classes for VOC 2007 and 2012). We provide the 
learning details in Appendix A3. Our implementation is 
based on the Caffe toolbox [23]. 

Setting the R-CNN as a baseline method, we compared 
the detection performance of our proposed methods, such 
as R-CNN with FGS (R-CNN FGS), R-CNN trained with 
structured SVM objective (R-CNN StructObj), and their 
combination (R-CNN StructObj FGS). Since our goal is 
to localize the bounding boxes more accurately at the object 
regions, we also consider the loU of 0.7 for an evaluation 
criterion, which only counts the detection results as correct 
when the overlap between the predicted bounding box and 
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Figure 2: mAP at different loU criteria on PASCAL VOC 
2007 test set using an oracle detector. We used different 
numbers of bounding boxes proposed by selective search 
(SS) [47], Objectness [2], local random search, and our 
proposed FTS methods. The average numbers of bounding 
boxes used for evaluation are specified. 


the ground truth is greater than 70%. This is more challeng¬ 
ing than common practices (e.g., loU > 0.5), but will be a 
good indicator for a better localization of an object bound¬ 
ing box if successful. 


5.1. FGS efficacy test with oracle detector 

Before reporting the performance of the proposed meth¬ 
ods in R-CNN framework, we demonstrate the efficacy of 
FGS algorithm using an oracle detector. We design a hypo¬ 
thetical oracle detector whose score function is defined as 
fideAxi,y) = IoU( 2 /, yi), where yi is a ground truth anno- 
tation for an image Xi. The score function is ideal in the 
sense that it outputs high scores for bounding boxes with 
high overlap with the ground truth and vice versa, overall 
achieving 100% mAP. 

We summarize the results in Figure 2. We report the per¬ 
formance on the VOC 2007 test set at different levels of loU 
criteria (0.1, •• • , 0.9) for the baseline selective search (SS; 
“fast mode” in [47]), selective search with objectness [2] 
(SS Objectness), selective search with extended super¬ 
pixel similarity measurements (SS extended) [47], “quality 
mode” of selective search (SS quality) [47], local random 
search,^ and the proposed FGS method with the baseline 
selective search. 

For low values of loU (< 0.3), all methods using the or¬ 
acle detectors performed almost perfectly due to the ideal 
score function. However, we found that the detection per¬ 
formance with different region proposal schemes other than 
our proposed FGS algorithm start to break down at high val¬ 
ues of loU. For example, the performance of SS, SS Ob¬ 
jectness, SS extended, and local random search methods, 
which used around 2,000 ^ 3, 500 bounding boxes per im¬ 
age in average, significantly dropped at loU > 0.5. SS qual¬ 
ity method kept pace with the FGS method until loU of 0.6, 


^Like local FGS, local random search first determine the local search 
regions by NMS. However, it randomly choose a fixed number of bounding 
box in those regions rather than sequentially proposing new boxes based 
on some informed method. 
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Model 

BBoxReg 

aero 

bike 

bird 

boat 

bottle 

bus 

car 

cat 

chair 

cow 

table 

dog 

horse 

mbike 

person 

plant 

sheep 

sofa 

train 

tv 

mAP 

R-CNN (AlexNet) 

No 

64.2 

69.7 

50.0 

41.9 

32.0 

62.6 

71.0 

60.7 

32.7 

58.5 

46.5 

56.1 

60.6 

66.8 

54.2 

31.5 

52.8 

48.9 

57.9 

64.7 

54.2 

R-CNN (VGG) 

No 

68.5 

74.5 

61.0 

37.9 

40.6 

69.2 

73.7 

69.9 

37.2 

68.6 

56.8 

70.6 

69.0 

67.1 

59.6 

33.4 

63.9 

58.9 

62.6 

68.5 

60.6 

+ StructObj 

No 

68.7 

73.5 

62.6 

40.6 

41.5 

69.6 

73.5 

71.1 

39.9 

69.6 

58.1 

70.0 

67.5 

69.8 

59.8 

35.9 

63.6 

59.0 

62.6 

67.7 

61.2 

+ StructObj-FT 

No 

69.3 

75.2 

62.2 

39.4 

42.3 

70.7 

74.5 

74.3 

40.4 

71.3 

59.8 

72.0 

69.8 

69.4 

60.3 

35.3 

64.5 

62.0 

63.7 

69.8 

62.3 

+ FGS 

No 

70.6 

78.4 

65.7 

46.2 

48.8 

74.6 

77.0 

74.3 

42.7 

70.8 

60.9 

75.1 

75.8 

70.7 

66.3 

37.1 

66.3 

57.6 

66.6 

71.0 

64.8 

+ StructObj + FGS 

No 

73.4 

80.9 

64.5 

46.7 

49.1 

73.9 

78.2 

76.8 

44.8 

75.3 

63.0 

75.3 

74.2 

72.7 

68.5 

37.0 

67.5 

58.1 

66.9 

70.5 

65.9 

+ StructObj-FT + FGS 

No 

72.5 

78.8 

67.0 

45.2 

51.0 

73.8 

78.7 

78.3 

46.7 

73.8 

61.5 

77.1 

76.4 

73.9 

66.5 

39.2 

69.7 

59.4 

66.8 

72.9 

66.5 

R-CNN (AlexNet) 

Yes 

68.1 

72.8 

56.8 

43.0 

36.8 

66.3 

74.2 

67.6 

34.4 

63.5 

54.5 

61.2 

69.1 

68.6 

58.7 

33.4 

62.9 

51.1 

62.5 

64.8 

58.5 

R-CNN (VGG) 

Yes 

70.8 

77.1 

69.4 

45.8 

48.4 

74.0 

77.0 

75.0 

42.2 

72.5 

61.5 

75.6 

77.7 

66.6 

65.3 

39.1 

65.8 

64.2 

68.6 

71.5 

65.4 

+ StructObj 

Yes 

73.1 

77.5 

69.2 

47.6 

47.6 

74.5 

78.2 

75.4 

44.5 

76.3 

64.9 

76.7 

76.3 

69.9 

68.1 

39.4 

67.0 

65.6 

68.7 

70.9 

66.6 

+ StructObj-FT 

Yes 

72.6 

79.4 

69.4 

45.2 

47.8 

74.4 

77.8 

76.5 

45.4 

76.3 

61.4 

80.2 

77.1 

73.8 

66.8 

41.1 

67.8 

64.7 

67.9 

72.3 

66.9 

+ FGS 

Yes 

74.2 

78.9 

67.8 

51.6 

52.3 

75.7 

78.7 

76.6 

45.4 

72.4 

63.1 

76.6 

79.3 

70.7 

68.0 

40.3 

67.8 

61.8 

70.2 

71.6 

67.2 

+ StructObj + FGS 

Yes 

74.1 

83.2 

67.0 

50.8 

51.6 

76.2 

81.4 

77.2 

48.1 

78.9 

65.6 

77.3 

78.4 

75.1 

70.1 

41.4 

69.6 

60.8 

70.2 

73.7 

68.5 

+ StructObj-FT + FGS 

Yes 

71.3 

80.5 

69.3 

49.6 

54.2 

75.4 

80.7 

79.4 

49.1 

76.0 

65.2 

79.4 

78.4 

75.0 

68.4 

41.6 

71.3 

61.2 

68.2 

73.3 

68.4 


Table 1: Test set mAP of VOC 2007 with loU > 0.5. The entries with the best APs for each object category are bold-faced. 


Model 

BBoxReg 

aero 

bike 

bird 

boat 

bottle 

bus 

car 

cat 

chair 

cow 

table 

dog 

horse 

mbike 

person 

plant 

sheep 

sofa 

train 

tv 

mAP 

R-CNN (AlexNet) 

No 

32.9 

40.1 

19.7 

18.7 

11.1 

39.4 

40.5 

26.5 

14.8 

29.8 

24.5 

26.4 

23.7 

31.9 

18.5 

13.3 

27.6 

25.8 

26.6 

39.5 

26.6 

R-CNN (VGG) 

No 

40.2 

43.3 

23.4 

14.4 

13.3 

48.2 

44.5 

36.4 

17.1 

34.0 

27.9 

36.3 

26.8 

28.2 

21.2 

10.3 

33.7 

36.6 

31.6 

48.9 

30.8 

+ StructObj 

No 

42.5 

44.4 

24.5 

17.8 

15.3 

46.8 

46.4 

37.9 

17.6 

33.4 

26.6 

36.8 

24.3 

31.5 

21.3 

10.4 

30.0 

36.1 

30.6 

46.3 

31.0 

+ StructObj-FT 

No 

44.1 

47.1 

23.4 

16.6 

16.4 

50.1 

48.7 

39.7 

18.4 

39.4 

28.6 

38.6 

27.5 

32.4 

23.6 

11.1 

33.1 

41.0 

34.3 

49.6 

33.2 

+ FGS 

No 

44.3 

55.5 

28.9 

19.1 

22.9 

56.9 

57.6 

37.8 

19.6 

35.7 

31.9 

38.1 

43.0 

42.7 

30.3 

9.8 

42.3 

33.3 

43.4 

55.4 

37.4 

+ StructObj + FGS 

No 

43.5 

56.1 

30.9 

18.7 

24.9 

55.2 

57.6 

38.9 

20.7 

38.6 

28.4 

37.7 

38.7 

46.3 

30.9 

8.4 

37.6 

37.0 

42.2 

51.3 

37.2 

+ StructObj-FT + FGS 

No 

46.3 

58.1 

31.1 

21.6 

25.8 

57.1 

58.2 

43.5 

23.0 

46.4 

29.0 

40.7 

40.6 

46.3 

33.4 

10.6 

41.3 

40.9 

45.8 

56.3 

39.8 

R-CNN (AlexNet) 

Yes 

47.6 

48.7 

25.3 

25.0 

17.3 

53.4 

54.6 

36.8 

16.7 

42.3 

31.6 

35.8 

38.0 

41.8 

24.5 

14.3 

38.8 

28.9 

34.0 

49.0 

35.2 

R-CNN (VGG) 

Yes 

45.1 

48.6 

26.0 

18.2 

21.2 

57.2 

52.4 

37.3 

20.1 

33.7 

31.9 

38.8 

39.6 

36.3 

26.5 

9.2 

37.8 

33.4 

39.4 

50.7 

35.2 

+ StructObj 

Yes 

49.4 

56.5 

36.5 

21.3 

23.3 

61.0 

58.1 

44.3 

20.8 

47.4 

33.3 

39.8 

40.7 

45.9 

31.0 

14.7 

39.6 

42.9 

45.7 

56.9 

40.5 

+ StructObj-FT 

Yes 

49.3 

58.1 

35.4 

23.3 

24.4 

62.3 

60.1 

45.8 

21.8 

48.7 

32.4 

41.8 

43.2 

45.7 

32.0 

14.4 

44.6 

45.1 

48.6 

59.8 

41.8 

+ FGS 

Yes 

50.9 

59.8 

34.4 

20.9 

31.6 

66.1 

62.3 

44.9 

22.0 

46.5 

36.8 

42.5 

51.4 

46.8 

34.1 

13.5 

44.7 

39.1 

48.9 

57.7 

42.7 

+ StructObj + FGS 

Yes 

53.6 

60.7 

32.1 

19.9 

31.3 

63.2 

63.2 

46.4 

23.6 

53.0 

34.9 

40.4 

53.6 

49.9 

34.6 

10.2 

42.2 

40.1 

48.3 

58.3 

43.0 

+ StructObj-FT + FGS 

Yes 

47.1 

61.8 

35.2 

18.1 

29.7 

66.0 

64.7 

48.0 

25.3 

50.4 

34.9 

43.7 

50.8 

49.4 

36.8 

13.7 

44.7 

43.6 

49.8 

60.5 

43.7 


Table 2: Test set mAP of VOC 2007 with loU > 0.7. The entries with the best APs for each object category are bold-faced. 


but again, the performance started to drop at loU >0.7. 

On the other hand, the performance of FGS dropped 5% 
in mAP at loU of 0.9 by only introducing approximately 
100 new bounding boxes per image. Given that SS qual¬ 
ity requires 10, 000 region proposals per image, our pro¬ 
posed FGS method is much more computationally efficient 
80% less bounding boxes) while localizing the bound¬ 
ing boxes much more accurately. This provides an insight 
that, if the detector is accurate, our Bayesian optimization 
framework would limit the number of bounding boxes to a 
manageable number (e.g., few thousands per image on av¬ 
erage) to achieve almost perfect detection results. 

We also report similar experimental analysis for the real 
detector trained with the proposed structured objective in 
Appendix A6. 

5.2. PASCAL VOC 2007 

In this section, we demonstrate the performance of our 
proposed methods on PASCAL VOC 2007 [12] detection 
task (comp4), a standard benchmark for object detection 
problem. Similarly to the training pipeline of R-CNN [18], 
we finetuned the CNN models (with softmax classification 
layer) pretrained on ImageNet database using images from 
both train and validation sets of VOC 2007 and further 
trained the network with linear SVM (baseline) or the pro¬ 


posed structured SVM objective. We evaluated on the test 
set using the proposed FGS algorithm. For post-processing, 
we performed NMS and bounding box regression [18]. 

Figure 3 shows representative examples of successful de¬ 
tection using our method. For these cases, our method can 
localize objects accurately even if the initial bounding box 
proposals don’t have good overlaps with the ground truth. 
We show more examples (including the failure cases) in Ap¬ 
pendix A9, AlO, All. 

The summary results are in Table 1 with loU criteria of 
0.5 and Table 2 with 0.7. We report the performance with 
the AlexNet [26] and the VGGNet (16 layers) [42], a deeper 
CNN model than AlexNet that showed a significantly bet¬ 
ter recognition performance and achieved the best perfor¬ 
mance on object localization task in ILSVRC 2014.^ First 
of all, we observed the significant performance improve¬ 
ment by simply having a better CNN model. Building upon 
the VGGNet, the FGS improved the performance by 4.2% 
and 1.8% in mAP without and with bounding box regres¬ 
sion (Table 1). It becomes much more significant when 
we consider loU criteria of 0.7 (Table 2), improving upon 
the baseline model by 6.6% and 7.5% in mAP without and 

^The 16-layer VGGNet can be downloaded from: https : / /gist. 
github.com/ksimonyan/211839e770f7b538e2d8. 
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Model 

BBoxReg 

aero 

bike 

bird 

boat 

bottle 

bus 

car 

cat 

chair 

cow 

table 

dog 

horse 

mbike 

person 

plant 

sheep 

sofa 

train 

tv 

mAP 

R-CNN (AlexNet) 

No 

68.1 

63.8 

46.1 

29.4 

27.9 

56.6 

57.0 

65.9 

26.5 

48.7 

39.5 

66.2 

57.3 

65.4 

53.2 

26.2 

54.5 

38.1 

50.6 

51.6 

49.6 

R-CNN (VGGNet) 

No 

76.3 

69.8 

57.9 

40.2 

37.2 

64.0 

63.7 

80.2 

36.1 

63.6 

47.3 

81.1 

71.2 

73.8 

59.5 

30.9 

64.2 

52.2 

62.4 

58.7 

59.5 

R-CNN (AlexNet) 

Yes 

71.8 

65.8 

52.0 

34.1 

32.6 

59.6 

60.0 

69.8 

27.6 

52.0 

41.7 

69.6 

61.3 

68.3 

57.8 

29.6 

57.8 

40.9 

59.3 

54.1 

53.3 

R-CNN (VGGNet) 

Yes 

79.2 

72.3 

62.9 

43.7 

45.1 

67.7 

66.7 

83.0 

39.3 

66.2 

51.7 

82.2 

73.2 

76.5 

64.2 

33.7 

66.7 

56.1 

68.3 

61.0 

63.0 

+ StructObj 

Yes 

80.9 

74.8 

62.7 

42.6 

46.2 

70.2 

68.6 

84.0 

42.2 

68.2 

54.1 

82.2 

74.2 

79.8 

66.6 

39.3 

67.6 

61.0 

71.3 

65.2 

65.1 

+ FGS 

Yes 

80.5 

73.5 

64.1 

45.3 

48.7 

66.5 

68.3 

82.8 

39.8 

68.2 

52.7 

82.1 

75.1 

76.6 

66.3 

35.5 

66.9 

56.8 

68.7 

61.6 

64.0 

+ StructObj + FGS 

Yes 

82.9 

76.1 

64.1 

44.6 

49.4 

70.3 

71.2 

84.6 

42.7 

68.6 

55.8 

82.7 

77.1 

79.9 

68.7 

41.4 

69.0 

60.0 

72.0 

66.2 

66.4 

NIN [30] 

- 

80.2 

73.8 

61.9 

43.7 

43.0 

70.3 

67.6 

80.7 

41.9 

69.7 

51.7 

78.2 

75.2 

76.9 

65.1 

38.6 

68.3 

58.0 

68.7 

63.3 

63.8 


Table 3: Test set mAP of VOC 2012 with loU > 0.5. The entries with the best APs for each object category are bold-faced. 



Figure 3: Detection examples from PASCAL VOC 2007 test set. Two examples from 20 object categories are shown, with 
the ground truth bounding boxes (green), the boxes obtained by baseline R-CNN (VGGNet) (red), and those obtained by the 
proposed R-CNN StructObj FGS (yellow). The numbers near the bounding boxes denote the loU with the ground truth. 


with bounding box regression. The results demonstrate that 
our FGS algorithm is effective in accurately localizing the 
bounding box of an object. 

Further improvement has been made by training a clas¬ 
sifier with structured SVM objective; we obtained 68.5% 
mAP in loU criteria of 0.5, which, to our knowledge, is 
higher than the best published results, and 43.0% mAP in 
loU criteria of 0.7 with FGS and bounding box regression 
by training the classification layer only (“StructObj”). By 
finetuning the whole CNN classifiers (“StructObj-FT”), we 
observed extra improvement for most cases; for example, 
we obtained 43.7% mAP in lOU criteria of 0.7, which im¬ 
proves by 0.7% in mAP over the method without finetuning. 
However, for IoU>0.5 criterion, the overall improvement 
due to finetuning was relatively small, especially when us¬ 
ing bounding box regression. In this case, considering the 
high computational cost for finetuning, we found that train¬ 
ing only the classification layer is practically a sufficient 
way to learn a good localization-aware classifier. 

We provide in-depth analysis of our proposed methods in 
the appendices. Specifically, we report the precision-recall 
curves of different combinations of the proposed methods 
(Appendix A7), the performance of FGS with different GP 


iterations (Appendix A5), the analysis of localization accu¬ 
racy (Appendix A8), and more detection examples. 

5.3. PASCAL VOC 2012 

We also evaluate the performance of the proposed meth¬ 
ods on PASCAL VOC 2012 [14]. As the data statistics are 
similar to VOC 2007, we used the same hyperparameters as 
described in Section 5.2 for this experiment. We report the 
test set mAP over 20 object categories in Table 3. Our pro¬ 
posed method shows improvement by 2.1% with R-CNN 
StructObj and 1.0% with R-CNN FGS over baseline R- 
CNN using VGGNet. Finally, we obtained 66.4% mAP by 
combining the two methods, which significantly improved 
upon the baseline R-CNN model and the previously pub¬ 
lished results on the leaderboard. 

6. Conclusion 

In this work, we proposed two complementary meth¬ 
ods to improve the performance of object detection in R- 
CNN framework with 1) fine-grained search algorithm in a 
Bayesian optimization framework to refine the region pro¬ 
posals and 2) a CNN classifier trained with structured SVM 
objective to improve localization. We demonstrated the 
state-of-the-art detection performance on PASCAL VOC 
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2007 and 2012 benchmarks under standard localization re¬ 
quirements. Our methods showed more significant im¬ 
provement with higher loU evaluation criteria (e.g., loU 
= 0.7), and hold promise for mission-critical applications 
that require highly precise localization, such as autonomous 
driving, robotic surgery and manipulation. 
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Al. Parameter estimation for finetuning with structured SVM objective 

The model parameters are updated via gradient descent. The gradient, for example, with respect to the CNN parameters 
0 ( 7 ^ re) for positive examples is given as follows: 


^^pos,i 

de 


0 5 ^pOSji - 0 

^ , /ipos,i = 1 - w'''(l){xi,yi) 

^ otherwise 


(A-1) 


where yi = argmax Similarly, the gradient for negative examples can be computed as 

follows: 


^^neg,i Jo •) ^negji — 0 

dQ 1 ^ ^ I + w'^(l){Xi,yi) 


(A-2) 


where yi = arg max vo^(j){xi^y). The gradient with respect to the parameters of all layers of CNN can be computed 

y ^ I 

efficiently using backpropagation. When finetuning the entire network, the parameter updated in the hard mining procedure 
illustrated by Algorithm A-1 is done by replacing w with the CNN parameters. 
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A2. Details on hard negative data mining 

The active set consisting of the hard training instances are updated in two steps during the iterative learning process. First, 
we include instances ^ G X to the active set when they are likely to be active, i.e., affect the gradient: 

{(l){xi,y) - (l){xi,yi)) + A^''''{y,yi) > max{0,1 - (l){xi,yi)} - ei,Vi G /pos (A-3) 

l-\-w'^(l){xi,y) > G /neg (A-4) 

Second, once new parameters are estimated, we exclude instances from the current active set when they are likely to be 
inactive, i.e., have no effect on the gradient: 

w'^ {(l){xi,y) - (t){xi,yi)) + A^'^'^iy^yi) < min{0,1 - (t){xi,yi)} - 62 , Vi G /pos (A-5) 

l^w'^(t){Xi,y) < - 62 , Vi G /neg (A-6) 

In our experiments, we used ei = 0.0001 and 62 = 0.2. The values of ei, 62 are the same as those for the SVM training 
in R-CNN [18]. We did not observe a noticeable performance fluctuation due to different ei, 62 values. Algorithm A-1 
summarizes the hard-mining procedure. 


Algorithm A-1 Parameter estimation with hard mining 

Require: Initial parameters wq, maximum epoch number epochj^^^, training images {xi^yi}f£i, positive and negative index 

/pos 5 -^neg 

Ensure: Final parameters w. 

1: The active set A,^{{xi^yi) : i G /pos} 

2: A.) Ainc — '^0 

3: for epoch = 1, ..., epoch^^^^^ do 

4: for i = 1,..., M do 

5: ‘Ainc ^ Ainc u {{y, Xi) : y s.t. (A-3), (A-4) for Xi ,«)} 

6: if|.4i„c| > update_threshold or i == M then 

7: A i — A U Ainc^ Ainc ^ 

8: update the classifier/network parameters w on A 

9: A^ x) e A: y s.t. (A-5), (A-6) for x, w} 

10: end if 

11: end for 

12: end for 


A3. Implementation details on model parameter estimation 

For our experiments on PASCAL VOC 2007 and VOC 2012, we first finetune the CNN pretrained on ImageNet by 
stochastic gradient descent with a 21 -way softmax classification layer, where 20 ways are for the 20 object categories of 
interest, and the rest 1 way is for the background. In this step, the SGD learning rate starts at 0.0003, and decreases by 0.3 
every 15000 iterations with a mini-batch size of 48. We set the momentum to 0.9, and the weight decay to 0.0005 for all the 
layers. 

After that, we replace the softmax loss layer with a 20-way structured loss layer, where each way is a binary classifier, and 
the hinge loss for different category are simply summed up. 

For classification layer only learning, L-BFGS is adopted, as batch gradient descent for a single layer. Each category has 
an associated active set for hard negative mining. The classifier update happens independently for each category when 5000 
(update_threshold in Algorithm A-1) new hard examples are added to the active set. It is worth mentioning that, in the 
beginning of the hard negative mining, significantly more positive images are present than the negative images, resulting in 
serious unbalance of the active training samples. As a heuristic to avoid this problem, we limit the number of positive image 
to the number of the active negative images when classifier update happens in the first epoch. We run the hard negative mining 
for 2 epochs in total. The first epoch is for initializing the active set with the above heuristic, and the rest is for learning with 
the all the training data. Compared to the linear SVM training in R-CNN [18], our L-BFGS based solution to the structured 
objective costs 8 ^ lOx longer time. However, it turns out to be significantly more efficient than [24]. 
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For the entire network finetuning, we initialize the structured loss layer with the weights obtained bythe classification- 
layer-only learning. The whole network is then finetuned by backpropagating the gradient from the top layer with a fixed 
SGD learning rate of 10“^. For implementation simplicity, we keep updating the active sets until the end of an epoch, and 
update the classifiers per epoch (i.e., update_threshold = +oc in Algorithm A-1). Like before, each category still has 
one particular active set. However, the network parameters (except for the classifier) are shared across all the category so 
that the feature extraction time is not scaled up with the number of categories. In practice, we found one epoch was enough 
for both hard negative mining and SGD in the entire network finetuning case. Running more epochs did not make noticeable 
improvement on the final detection performance on PASCAL VOC 2007 test set, but cost a significantly larger amount of 
training time. 


A4. Efficiency of fine-grained search (FGS) 

In this section, we provide more details on the local FGS presented in Algorithm 1 of the main text. 

GPR practical efficiency: For the initial proposals given by selective search, |Diocai| usually turns out to be 20 to 100, 
and line 9,10 can be efficiently solved in around 9 and 6 L-BFGS [37] iterations for StructObj, respectively. 

GPU parallelism for CNN: One image can have multiple search regions (e.g., line 6), and 20 object categories together 
yield more regions. FGS proposes one extra box per iteration for every search region. These boxes are fed into the CNN 
together to utilize GPU parallelism. For VGGNet, we use the batch size of 8 for computing the CNN features within the FGS 
procedure. 

Time overhead: For PASCAL VOC 2007, FGS (tmax = 8) induced only ^ 15% total overhead compared to initial time 
cost, which mostly consists of CNN feature extraction from bounding boxes proposed by selective search (SS). Specifically, 
1/3 of the overhead is caused by CNN feature extraction from the newly proposed boxes (line 11); the rest is caused by GPR 
(line9, 10), NMS (line 5), and pruning (line 12). Each GP iteration (line 2-16) counts for ~ 2% with respect to the initial time 
cost, and < 8 GP iterations were sufficient for convergence. Figure A-1 shows the trends of the accumulated time overhead 
introduced by FGS per iteration. The time overhead due to FGS may vary with different datasets (e.g., VOC 2012), but in 
general, it is < 20% compared to initial time cost. 


25(16%) 



E 

I 10(6%) 


5 (3%) 


0 ( 0 %) 


2 3 4 5 6 

Maximum FTS iteration number (t ) 


Figure A-1: Time cost with different maximum number of iterations. The “ratio” is with respect to the time cost of extracting 
features for the initial bounding boxes. 


A5. Step-wise performance of fine-grained search (FGS) 

We evaluated the mAP at each GP iteration using R-CNN(VGG)-FStructObj-FFGS-FBBoxReg. The mAPs from 0 to 8 GP 
iterations are reported in Table A-l. mAP increases rapidly in the first 4 iterations, and becomes stable in the following 
iterations. 

A6. Test set mAP on PASCAL VOC 2007 using VGGNet with different region proposal methods 

In Figure 2 of the paper, we report and compare the test set mAPs on PASCAL VOC 2007 with different region proposal 
algorithms (e.g., selective search (SS) [47] at different modes, Objectness [2]) using an oracle detector. In this section, we 
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# GP iter 

0 

1 

2 

3 

4 

5 

6 

7 

8 

mAP 

66.6 

67.5 

67.8 

68.2 

68.3 

68.6 

68.4 

68.6 

68.5 


Table A-1: Test set mAPs on PASCAL VOC 2007 for “R-CNN(VGG)+StructObj+FGS+BBoxReg” with different number 
of GP iterations 


performed similar experiments using a real detector trained with structured SVM objective based on VGGNet features. The 
summary results are given in Table A-2 and A-3. 


Region proposal methods \ loU threshold 

0.1 

0.2 

0.3 

0.4 

0.5 

0.6 

0.7 

0.8 

SS (^2000 boxes per image) 

74.3 

73.8 

72.5 

69.6 

61.2 

47.4 

31.2 

15.4 

SS Objectness (^3000 boxes per image) 

73.4 

72.9 

71.7 

68.9 

60.6 

47.0 

31.1 

15.3 

SS extended (^3500 boxes per image) 

74.3 

73.9 

72.8 

70.1 

62.8 

48.5 

32.3 

15.9 

SS quality (~ 10000 boxes per image) 

74.1 

73.7 

72.7 

70.0 

63.7 

51.4 

35.7 

17.9 

SS FGS (^2150 boxes per image) 

76.6 

76.1 

75.0 

72.4 

65.8 

54.1 

37.2 

17.4 


Table A-2: Test set mAPs on PASCAL VOC 2007 with different region proposal methods at varying loU thresholds from 0.1 
to 0.8 without bounding box regression. 


Region proposal methods \ loU threshold 

0.1 

0.2 

0.3 

0.4 

0.5 

0.6 

0.7 

0.8 

SS (^2000 boxes per image) 

77.4 

76.9 

75.9 

73.4 

66.5 

55.8 

40.9 

19.4 

SS Objectness (^3000 boxes per image) 

76.9 

76.5 

75.5 

73.0 

66.0 

55.2 

40.4 

19.3 

SS extended (^3500 boxes per image) 

77.6 

77.2 

76.2 

73.8 

67.6 

57.0 

41.8 

19.7 

SS quality (^10000 boxes per image) 

77.1 

76.7 

75.8 

73.3 

67.1 

57.0 

42.3 

19.4 

SS FGS (^2150 boxes per image) 

78.3 

77.8 

76.8 

74.4 

68.5 

57.9 

43.1 

20.4 


Table A-3: Test set mAPs on PASCAL VOC 2007 with different region proposal methods at varying loU thresholds from 0.1 
to 0.8 with bounding box regression. 


For both cases (with and without bounding box regression), the FGS showed improved performance over other region 
proposal methods using smaller number of region proposals. In particular, the SS FGS method (row 5 in Table A-2 and A- 
3) even outperformed the SS “quality” mode [47], which requires ^ 5x more computational expenses than our proposed 
method to compute CNN-based detection scores for bounding box proposals. 

Although the current state-of-the-art CNN-based detector outperforms other object detection methods [15, 33, 5] by a 
large margin, there still remains a significant gap with that of the hypothetical oracle detector. This motivates us to further 
research on improving the quality of the CNNs for better visual object recognition performance. 


A7. Precision-recall curves on PASCAL VOC 2007 

In this section, we present the precision-recall curves for four different models. Specifically, we show results for VGGNet, 
VGGNet trained with structured SVM objective (VGGNet StructObj), VGGNet with FGS (VGGNet FGS), and VGGNet 
with both (VGGNet -i- StructObj -i- FGS) in Figure A-2. In general, the improvement from the structured SVM objective is 
more significant for the high recall range (i.e., recall > 0.5) than the low recall range other than “sheep” class. FGS usually 
improves the precision for most object categories. 
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Figure A-2: Category-wise precision-recall curves for four different models (VGGNet, -i- StructObj, -i- FGS, -i- StructObj -i- 
FGS) on PASCAL VOC 2007 database. 
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A8. Localization accuracy on PASCAL VOC 2007 

In this section, we analyze the localization behavior of different methods. We find the predicted bounding boxes that most 
accurately localize the ground truth from each method by picking the detected box with the highest overlap (i.e., loU) with 
respect to each ground truth bounding box. Comparisons between the different methods are performed by estimating the 
distribution of the overlaps for every category. Our findings are shown in Figure A-3 (without BBoxReg) and Figure A-4 
(with BBoxReg + the baseline without BBoxReg). Methods with better localization ability should have higher frequency 
at higher loU (i.e., loU between 0.6 and 0.9) and lower frequency at lower loU (i.e., loU between 0 and 0.4). Curve peaks 
leaning to the right signify better localization. 

As shown in Figure A-3 and A-4, individually applying FGS and StructObj results in better localization compared to the 
baseline R-CNN, regardless of using or not using bounding box regression. However, the performance improvement due to 
FGS and StructObj independently results in different phenomena. In general, we found that FGS pushes the distribution peak 
to the right, and StructObj pulls the distribution peak higher while pushing the frequencies in the low loU interval down. This 
indicates that FGS can propose more accurately localized boxes if the original set of bounding boxes are reasonably well 
localized, and StructObj can make detection scores more accurate (i.e., give low detection scores to boxes with low overlap 
and high detection scores to boxes with high overlap) based on the overlap of the proposed bounding boxes with respect to the 
ground truth. Combining FGS and StructObj together capitalizes on the advantages of both, and leads to the best localization 
accuracy. 

A9. Examples with the largest improvement on PASCAL VOC 2007 test set 

In this section, we show examples with the largest improvement in localization accuracy using our best proposed method 
(VGGNet StructObj FGS) over the baseline detection method (VGGNet) from PASCAL VOC 2007 test set. For each 
example in Figure A-5, we show the category of interest on the left-bottom corner of the image, and draw the detection of 
our best proposed method (in yellow box) that is best matched with the particular ground truth (in green box) and the best 
matched detection of the baseline (in red box). The number on the top right of the detected bounding box denotes the loU 
with the ground truth. 

AlO. Top-ranked false positives on PASCAL VOC 2007 test set 

In this section, we show examples of the top-ranked false positive detections'^ (in green box) of our best proposed method 
(VGGNet StructObj FGS) from PASCAL VOC 2007 test set (Figure A-6). We categorize the false positives into four 
categories as in [22] : 

• loc: poor localization, 

• sim: confusion with similar objects, 

• oth: confusion with other objects, 

• bg: confusion with background or unlabeled objects. 

The overlap (ov) measured by the loU between the false positive detection and its best matching ground truth bounding box 
is provided. For “loc” examples, the closest bounding box annotated with the same object category is provided as a ground 
truth (in yellow box). For “sim” or “oth” examples, the closest bounding box annotated with any object category is provided 
as a ground truth (in red box). 

All. Random detection examples on PASCAL VOC 2007 test set 

Finally, we show randomly selected detection examples of our best proposed method (VGGNet StructObj FGS) from 
PASCAL VOC 2007 test set. In Figure A-7, we use bounding boxes with different colors for different categories. The 
category label with detection score is displayed on the top-left corner of each bounding box. Detections with low scores are 
ignored. 


^^The top-ranked false positives are selected among false positive bounding boxes with the highest detection scores. 
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Figure A-3: Category-wise localization accuracy (in terms of the loU between a ground truth annotation and its closest 
detected box) distributions for four different models (VGGNet without BBoxReg, StructObj, FGS, StructObj FGS) 
on PASCAL VOC 2007 datasets. 
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Figure A-4: Category-wise localization accuracy (in terms of the loU between a ground truth annotation and its closest 
detected box) distributions for five different models (VGGNet without BBoxReg; VGGNet with BBoxReg, StructObj, 
FGS, StructObj FGS) on PASCAL VOC 2007 datasets. 
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Figure A-5: Examples with the largest improvement with regards to the baseline method on PASCAL VOC 2007 test set. 
Refer to the text for more detail. 
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Figure A-6: Top-ranked false positives of “VGGNet -h StructObj -h FGS” on PASCAL VOC 2007 test set. 
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Figure A-7: Random detection examples of “VGGNet + StructObj + FGS” on PASCAL VOC 2007 test set. 
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